OLERA: A Semi-supervised Approach for Web Data Extraction with Visual Support

نویسندگان

  • Chia-Hui Chang
  • Shih-Chien Kuo
چکیده

Information extraction (IE) from semi-structured Web documents plays an important role for a variety of information agents. Over the past few years, researchers have developed a rich family of generic IE techniques based on supervised approaches which learn extraction rules from user-labelled training examples. However, annotating training data can be expensive when thousands of data sources need to be wrapped. In this article, we introduce OLERA, a semi-supervised IE system that produces extraction rules without detailed annotation of the training documents. Instead, a rough segment that contains all that need to be extracted in one record is given as an example. OLERA is designed with visualization support such that the discovered records is displayed in a spreadsheet-like table for schema assignment. The experiments show that OLERA performs well for program-generated Web pages with very few training pages and user intervention.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be ...

متن کامل

Web Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction

Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data ...

متن کامل

Data Analysis Project: Semi-Supervised Discovery of Named Entities and Relations from the Web

This project studies semi-supervised discovery of named entities, relational entities and prepositional phrase attachments within a read-the-web framework. Meanings of an entity can be improvised and updated faster in the internet world than printed references. The main idea of this project is to study the feasibility of characterizing entities by web content directly. The approach is that cont...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Graph Based Semi-Supervised Approach For Information Extraction

Classification techniques deploy supervised labeled instances to train classifiers for various classification problems. However labeled instances are limited, expensive, and time consuming to obtain, due to the need of experienced human annotators. Meanwhile large amount of unlabeled data is usually easy to obtain. Semi-supervised learning addresses the problem of utilizing unlabeled data along...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003